116        Bioinformatics

bcftools mpileup -Ou \

-f reference_file \

-b bam_list_file \

| bcftools call -vmO z \

-o sarscov2.vcf.gz

The “bcftools mpileup” is the command used for the pileup. The “-f” option specifies the

reference file used by the aligner to produce the BAM file, “-b” option specifies the BAM

file from which we wish to call variants. The above form allows both pileup and variant

calling.

In the following, we will discuss an example of variant calling pipeline using “bcftools”.

We will use SARS-CoV-2 raw sequence data from the NCBI SRA database to demonstrate

variant calling.

SARS-CoV-2 is the virus that causes COVID-19 which affected millions of people

around the world and caused thousands of deaths. The virus mutates in a short period of

time, producing new strains. The following are the NCBI SRA run unique identifiers of raw

data generated by whole genome sequencing of SARS-CoV-2:

SRR16989486

SRR16537313

SRR16537315

SRR16537317

In the following, we wish to identify variants (SNVs and InDels) from short reads of those

four samples and report that in a VCF file.

The programs used with this example include SRA toolkits, wget, gzip, samtools, bwa,

and bcftool on Linux.

The workflow for variant discovery includes: (i) acquiring the raw data (FASTQ files), (ii)

quality control if required, (iii) downloading and indexing the reference genome, (iv) using

an aligner to map reads in FASTQ files to a reference genome to generate a BAM file, (v)

sorting the aligned reads in BAM files, (vi) removal of duplicate reads from the BAM files,

and (vii) performing variant calling and generation of a single VCF file for genotyping of all

samples. Any new generated BAM files must be indexed before proceeding to the next step.

In the following, we will discuss each of these steps. First, we need to open the Linux ter-

minal, create a directory called “sarscov2” for this project, and make it the current working

directory.

mkdir sarscov2

cd sarscov2

4.2.1.1.1  Acquiring the raw data

The raw data files are stored in the NCBI SRA database with above SRA run IDs. The

FASTQ files are in sequence read archive (SRA) format that requires SRA toolkits to

be downloaded and extracted. For that purpose, we can use “fasterq-dump” with the

­following syntax: